Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add ujson support in pandas.io.json #3804

Merged
merged 10 commits into from
Jun 11, 2013
Merged

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 7, 2013

This is @wesm PR #3583 with this:

It builds now, and passes travis on py2 and py3, had 2 issues:

  • clean was erasing the *.c files from ujson
  • the module import didn't work because it was using the original init function

Converted to new io API: to_json / read_json

Docs added

@hayd
Copy link
Contributor

hayd commented Jun 7, 2013

yay!

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

This is pretty awesome.

One thing I think worth being explicit in the docs (am I right in saying this?) only works with valid JSON.

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

the json routines read/write from strings
this is unlike any of the other io routines that pandas has
which all take a path_or_buf

is this typical of dealing with JSON data?

should we have a kw to do this? always do it?

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

@jreback That is an excellent point, this should work as all the other read_s. I don't think this is necessarily typical to always have the string but at least it does makes it clear the read_json takes the entire string at once rather than by chunks.

(The first thing I did was open a json_file f.readlines(). f.read())

It'd certainly be a useful feature is we could go pd.from_json(datas_url) and perhaps this would be a fairly standard use case.

We could either:

  1. Have have a string kwarg, filepath_or_buffer is first argument (this would be my preference).
  2. Check if it's a filepath_or_buffer, if not there, treat as json string (seems like a can of worms)

(Also, to clarify previous point, from_json only reads valid json :) )

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

do u have a URL that yields JSON?

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

A more interesting example: https://api.github.com/repos/pydata/pandas/issues?per_page=100

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

parsed first try!

(Pdb) url_table
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 19 columns):
assignee        6  non-null values
body            100  non-null values
closed_at       0  non-null values
comments        100  non-null values
comments_url    100  non-null values
created_at      100  non-null values
events_url      100  non-null values
html_url        100  non-null values
id              100  non-null values
labels          100  non-null values
labels_url      100  non-null values
milestone       75  non-null values
number          100  non-null values
pull_request    100  non-null values
state           100  non-null values
title           100  non-null values
updated_at      100  non-null values
url             100  non-null values
user            100  non-null values
dtypes: int64(3), object(16)

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

The one thing that trips is dates (you just have to to_datetime after), but that can be left for another day.

Whoop! :)

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

yeh....we'll see how this goes....in 0.12 can add infer_types directive, kind of like read_html

@cpcloud
Copy link
Member

cpcloud commented Jun 8, 2013

i wonder if there are any other similar libraries or systems that have this much io functionality in a single package...

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

(Does infer_types work for unix time stamps? ...to get the roundtrip working. Anyway....)

@cpcloud
Copy link
Member

cpcloud commented Jun 8, 2013

doubtful since those are just integers...but i haven't tested

@cpcloud
Copy link
Member

cpcloud commented Jun 8, 2013

i tried

date +"%s" | python -c 'import sys; from dateutil.parser import parse; parse(sys.stdin.read())'

that doesn't work so i'm going to say no it won't work.

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

You can do pd.to_datetime on the column after reading.

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

yes an know....we have an open issue #3540 to make a better API for this, but look at #2015

so if you know they are epoch time stamps (e.g. passing in as an option maybe), then its easy, we can convert them

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

Timestamp accepts a nanosecond based epoch timestamp (e.g. nanoseconds since 1970; so epoch time stamps are in seconds since 1970, so juse int(1e9) and it will work).....

but there is an issue because sometimes they are not in seconds...so have to disambiguate

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

Just saying as to_json exports timestamps to unix time.

@cpcloud
Copy link
Member

cpcloud commented Jun 8, 2013

oh yes Timestamp works...hm maybe should add to read_html...i doubt people are using html tables to store unix timestamp but who the hell knows? maybe i'll wait until the api is sorted out

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

could add a parse_dates arg that takes a list of fields to try to convert?

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

read_html much tougher because people are just reading it in ...and there is no standard...@hayd is right...since its a standard, could even try to convert an integer column (if they are in range)?

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

(obviously far too reckless to just .applymap(pd.to_datetime) lol)

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

what's a quick way to fix inconcistent space/tabs....something got screwed up...

@cpcloud
Copy link
Member

cpcloud commented Jun 8, 2013

M-x untabify

@cpcloud
Copy link
Member

cpcloud commented Jun 8, 2013

on a region i think prolly whole file works too

@jreback
Copy link
Contributor Author

jreback commented Jun 8, 2013

@hayd actually .apply(pd.to_datetime) will not change the column if all conversions fail, so it is safe sort of

try out: convert_objects(convert_dates='coerce'))

@hayd
Copy link
Contributor

hayd commented Jun 8, 2013

Quite slow though?

@cpcloud
Copy link
Member

cpcloud commented Jun 8, 2013

convert_objects is ok speed wise it operates on blocks using cython functions so it's gotta be faster than lambdas :)

@cpcloud
Copy link
Member

cpcloud commented Jun 8, 2013

@jreback correct me if i'm wrong here...

@jreback
Copy link
Contributor Author

jreback commented Jun 11, 2013

@Komnomnomnom go ahead and paste here
and well get this in

@Komnomnomnom
Copy link
Contributor

Ok the following patch should make it safe to call Npy_releaseContext multiple times (which is what was causing the problem). Segmentation fault is gone and valgrind output from Python 2.7 debug build is clean. Likewise all tests pass for Python 2.7 and valgrind output for json tests is clean (i.e. there are no warnings for json related code).

diff --git a/pandas/src/ujson/python/JSONtoObj.c b/pandas/src/ujson/python/JSONtoObj.c
index 1db7586..160c30f 100644
--- a/pandas/src/ujson/python/JSONtoObj.c
+++ b/pandas/src/ujson/python/JSONtoObj.c
@@ -10,6 +10,7 @@ typedef struct __PyObjectDecoder
     JSONObjectDecoder dec;

     void* npyarr;       // Numpy context buffer
+    void* npyarr_addr;  // Ref to npyarr ptr to track DECREF calls
     npy_intp curdim;    // Current array dimension

     PyArray_Descr* dtype;
@@ -67,9 +68,7 @@ void Npy_releaseContext(NpyArrContext* npyarr)
         }
         if (npyarr->dec)
         {
-            // Don't set to null, used to make sure we don't Py_DECREF npyarr
-            // in releaseObject
-            // npyarr->dec->npyarr = NULL;
+            npyarr->dec->npyarr = NULL;
             npyarr->dec->curdim = 0;
         }
         Py_XDECREF(npyarr->labels[0]);
@@ -88,6 +87,7 @@ JSOBJ Object_npyNewArray(void* _decoder)
     {
         // start of array - initialise the context buffer
         npyarr = decoder->npyarr = PyObject_Malloc(sizeof(NpyArrContext));
+        decoder->npyarr_addr = npyarr;

         if (!npyarr)
         {
@@ -515,7 +515,7 @@ JSOBJ Object_newDouble(double value)
 static void Object_releaseObject(JSOBJ obj, void* _decoder)
 {
     PyObjectDecoder* decoder = (PyObjectDecoder*) _decoder;
-    if (obj != decoder->npyarr)
+    if (obj != decoder->npyarr_addr)
     {
         Py_XDECREF( ((PyObject *)obj));
     }
@@ -555,6 +555,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)
     pyDecoder.dec = dec;
     pyDecoder.curdim = 0;
     pyDecoder.npyarr = NULL;
+    pyDecoder.npyarr_addr = NULL;

     decoder = (JSONObjectDecoder*) &pyDecoder;

@@ -609,6 +610,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)

     if (PyErr_Occurred())
     {
+        Npy_releaseContext(pyDecoder.npyarr);
         return NULL;
     }

wesm and others added 10 commits June 11, 2013 10:02
DOC: docs in io.rst/whatsnew/release notes/api

TST: cleaned up cruft in test_series/test_frame
…will return a StringIO object)

     read_json will read from a string-like or filebuf or url (consistent with other parsers)
…or JSON string

     added keywords parse_dates,keep_default_dates to allow for date parsing in columns
     of a Frame (default is False, not to parse dates)
…(which both can be

     can be parsed with parse_dates=True in read_json)
@jreback
Copy link
Contributor Author

jreback commented Jun 11, 2013

patch applied.....looking good now

@hayd
Copy link
Contributor

hayd commented Jun 11, 2013

@jreback Something like this for requests: hayd@dbd968b

@jreback
Copy link
Contributor Author

jreback commented Jun 11, 2013

@wesm this is mergable....any objections?

@wesm
Copy link
Member

wesm commented Jun 11, 2013

Looks good to me, bombs away

@jreback
Copy link
Contributor Author

jreback commented Jun 11, 2013

3.2.1.....

jreback added a commit that referenced this pull request Jun 11, 2013
ENH: add ujson support in pandas.io.json
@jreback jreback merged commit a7f37d4 into pandas-dev:master Jun 11, 2013
@Komnomnomnom
Copy link
Contributor

Awesome. I'll see about merging in upstream changes. Will send thru a pull request soonish.

@jreback
Copy link
Contributor Author

jreback commented Jun 11, 2013

oh...you have additional dependencies on this?

@Komnomnomnom
Copy link
Contributor

Mentioned in #3583, there have been some enhancements / fixes in ultrajson since the pandas json version was originally written. Nothing major (I think) and should be straightforward enough to merge but it'd be a good idea to keep them in sync I think.

@jreback
Copy link
Contributor Author

jreback commented Jun 11, 2013

ok...sure...

@wesm
Copy link
Member

wesm commented Jun 13, 2013

thanks all for making this happen, especially to @Komnomnomnom for authoring this code in the first place =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants